Building a song recommender

------------- Dataset used: ------------- Million Songs Dataset Source: http://labrosa.ee.columbia.edu/millionsong/ Paper: http://ismir2011.ismir.net/papers/OS6-1.pdf The current notebook uses a subset of the above data containing 10,000 songs obtained from: https://github.com/turi-code/tutorials/blob/master/notebooks/recsys_rank_10K_song.ipynb



In [1]:

    
%matplotlib inline

import pandas
from sklearn.cross_validation import train_test_split
import numpy as np
import time
from sklearn.externals import joblib
import Recommenders as Recommenders
import Evaluation as Evaluation

Load music data



In [2]:

    
#Read userid-songid-listen_count triplets
#This step might take time to download data from external sources
triplets_file = 'https://static.turi.com/datasets/millionsong/10000.txt'
songs_metadata_file = 'https://static.turi.com/datasets/millionsong/song_data.csv'

song_df_1 = pandas.read_table(triplets_file,header=None)
song_df_1.columns = ['user_id', 'song_id', 'listen_count']

#Read song  metadata
song_df_2 =  pandas.read_csv(songs_metadata_file)

#Merge the two dataframes above to create input dataframe for recommender systems
song_df = pandas.merge(song_df_1, song_df_2.drop_duplicates(['song_id']), on="song_id", how="left")

Explore data

Music data shows how many times a user listened to a song, as well as the details of the song.



In [3]:

    
song_df.head()









    Out[3]:






  
    
      
      user_id
      song_id
      listen_count
      title
      release
      artist_name
      year
    
  
  
    
      0
      b80344d063b5ccb3212f76538f3d9e43d87dca9e
      SOAKIMP12A8C130995
      1
      The Cove
      Thicker Than Water
      Jack Johnson
      0
    
    
      1
      b80344d063b5ccb3212f76538f3d9e43d87dca9e
      SOBBMDR12A8C13253B
      2
      Entre Dos Aguas
      Flamenco Para Niños
      Paco De Lucia
      1976
    
    
      2
      b80344d063b5ccb3212f76538f3d9e43d87dca9e
      SOBXHDL12A81C204C0
      1
      Stronger
      Graduation
      Kanye West
      2007
    
    
      3
      b80344d063b5ccb3212f76538f3d9e43d87dca9e
      SOBYHAJ12A6701BF1D
      1
      Constellations
      In Between Dreams
      Jack Johnson
      2005
    
    
      4
      b80344d063b5ccb3212f76538f3d9e43d87dca9e
      SODACBL12A8C13C273
      1
      Learn To Fly
      There Is Nothing Left To Lose
      Foo Fighters
      1999

Length of the dataset



In [4]:

    
len(song_df)









    Out[4]:





2000000

Create a subset of the dataset



In [5]:

    
song_df = song_df.head(10000)

#Merge song title and artist_name columns to make a merged column
song_df['song'] = song_df['title'].map(str) + " - " + song_df['artist_name']

Showing the most popular songs in the dataset



In [6]:

    
song_grouped = song_df.groupby(['song']).agg({'listen_count': 'count'}).reset_index()
grouped_sum = song_grouped['listen_count'].sum()
song_grouped['percentage']  = song_grouped['listen_count'].div(grouped_sum)*100
song_grouped.sort_values(['listen_count', 'song'], ascending = [0,1])









    Out[6]:






  
    
      
      song
      listen_count
      percentage
    
  
  
    
      3660
      Sehr kosmisch - Harmonia
      45
      0.45
    
    
      4678
      Undo - Björk
      32
      0.32
    
    
      5105
      You're The One - Dwight Yoakam
      32
      0.32
    
    
      1071
      Dog Days Are Over (Radio Edit) - Florence + Th...
      28
      0.28
    
    
      3655
      Secrets - OneRepublic
      28
      0.28
    
    
      4378
      The Scientist - Coldplay
      27
      0.27
    
    
      4712
      Use Somebody - Kings Of Leon
      27
      0.27
    
    
      3476
      Revelry - Kings Of Leon
      26
      0.26
    
    
      1387
      Fireflies - Charttraxx Karaoke
      24
      0.24
    
    
      1862
      Horn Concerto No. 4 in E flat K495: II. Romanc...
      23
      0.23
    
    
      1805
      Hey_ Soul Sister - Train
      22
      0.22
    
    
      5032
      Yellow - Coldplay
      22
      0.22
    
    
      808
      Clocks - Coldplay
      21
      0.21
    
    
      2620
      Lucky (Album Version) - Jason Mraz & Colbie Ca...
      20
      0.20
    
    
      2299
      Just Dance - Lady GaGa / Colby O'Donis
      19
      0.19
    
    
      456
      Billionaire [feat. Bruno Mars]  (Explicit Albu...
      18
      0.18
    
    
      2689
      Marry Me - Train
      18
      0.18
    
    
      3064
      OMG - Usher featuring will.i.am
      18
      0.18
    
    
      4543
      Tive Sim - Cartola
      18
      0.18
    
    
      142
      Alejandro - Lady GaGa
      17
      0.17
    
    
      726
      Catch You Baby (Steve Pitron & Max Sanna Radio...
      17
      0.17
    
    
      1410
      Float On - Modest Mouse
      17
      0.17
    
    
      3868
      Somebody To Love - Justin Bieber
      17
      0.17
    
    
      631
      Bulletproof - La Roux
      16
      0.16
    
    
      1143
      Drop The World - Lil Wayne / Eminem
      16
      0.16
    
    
      3038
      Nothin' On You [feat. Bruno Mars] (Album Versi...
      16
      0.16
    
    
      4465
      They Might Follow You - Tiny Vipers
      16
      0.16
    
    
      870
      Cosmic Love - Florence + The Machine
      15
      0.15
    
    
      899
      Creep (Explicit) - Radiohead
      15
      0.15
    
    
      1680
      Halo - Beyoncé
      15
      0.15
    
    
      ...
      ...
      ...
      ...
    
    
      5094
      You Yourself are Too Serious - The Mercury Pro...
      1
      0.01
    
    
      5098
      You'll Never Know (My Love) (Bovellian 07 Mix)...
      1
      0.01
    
    
      5100
      You're A Wolf (Album) - Sea Wolf
      1
      0.01
    
    
      5102
      You're Gonna Miss Me When I'm Gone - Brooks & ...
      1
      0.01
    
    
      5103
      You're Not Alone - ATB
      1
      0.01
    
    
      5104
      You're Not Alone - Olive
      1
      0.01
    
    
      5108
      You've Passed - Neutral Milk Hotel
      1
      0.01
    
    
      5109
      Young - Hollywood Undead
      1
      0.01
    
    
      5111
      Younger Than Springtime - William Tabbert
      1
      0.01
    
    
      5112
      Your Arms Feel Like home - 3 Doors Down
      1
      0.01
    
    
      5113
      Your Every Idol - Telefon Tel Aviv
      1
      0.01
    
    
      5114
      Your Ex-Lover Is Dead (Album Version) - Stars
      1
      0.01
    
    
      5115
      Your Guardian Angel - The Red Jumpsuit Apparatus
      1
      0.01
    
    
      5117
      Your House - Jimmy Eat World
      1
      0.01
    
    
      5118
      Your Love - The Outfield
      1
      0.01
    
    
      5121
      Your Mouth - Telefon Tel Aviv
      1
      0.01
    
    
      5123
      Your Song (Alternate Take 10) - Cilla Black
      1
      0.01
    
    
      5126
      Your Visits Are Getting Shorter - Bloc Party
      1
      0.01
    
    
      5127
      Your Woman - White Town
      1
      0.01
    
    
      5130
      Ze Rook Naar Rozen - Rob De Nijs
      1
      0.01
    
    
      5131
      Zebra - Beach House
      1
      0.01
    
    
      5132
      Zebra - Man Man
      1
      0.01
    
    
      5133
      Zero - The Pain Machinery
      1
      0.01
    
    
      5135
      Zopf: Pigtail - Penguin Café Orchestra
      1
      0.01
    
    
      5137
      aNYway - Armand Van Helden & A-TRAK Present Du...
      1
      0.01
    
    
      5139
      high fives - Four Tet
      1
      0.01
    
    
      5140
      in white rooms - Booka Shade
      1
      0.01
    
    
      5143
      paranoid android - Christopher O'Riley
      1
      0.01
    
    
      5149
      ¿Lo Ves? [Piano Y Voz] - Alejandro Sanz
      1
      0.01
    
    
      5150
      Época - Gotan Project
      1
      0.01
    
  

5151 rows × 3 columns

Count number of unique users in the dataset



In [8]:

    
users = song_df['user_id'].unique()



In [9]:

    
len(users)









    Out[9]:





365

Quiz 1. Count the number of unique songs in the dataset



In [10]:

    
###Fill in the code here
songs = song_df['song'].unique()
len(songs)









    Out[10]:





5151

Create a song recommender



In [11]:

    
train_data, test_data = train_test_split(song_df, test_size = 0.20, random_state=0)
print(train_data.head(5))









    



                                       user_id             song_id  \
7389  94d5bdc37683950e90c56c9b32721edb5d347600  SOXNZOW12AB017F756   
9275  1012ecfd277b96487ed8357d02fa8326b13696a5  SOXHYVQ12AB0187949   
2995  15415fa2745b344bce958967c346f2a89f792f63  SOOSZAZ12A6D4FADF8   
5316  ffadf9297a99945c0513cd87939d91d8b602936b  SOWDJEJ12A8C1339FE   
356   5a905f000fc1ff3df7ca807d57edb608863db05d  SOAMPRJ12A8AE45F38   

      listen_count                 title  \
7389             2      Half Of My Heart   
9275             1  The Beautiful People   
2995             1     Sanctify Yourself   
5316             4     Heart Cooks Brain   
356             20                 Rorol   

                                                release      artist_name  \
7389                                     Battle Studies       John Mayer   
9275             Antichrist Superstar (Ecopac Explicit)   Marilyn Manson   
2995                             Glittering Prize 81/92     Simple Minds   
5316  Everything Is Nice: The Matador Records 10th A...     Modest Mouse   
356                               Identification Parade  Octopus Project   

      year                                   song  
7389     0          Half Of My Heart - John Mayer  
9275     0  The Beautiful People - Marilyn Manson  
2995  1985       Sanctify Yourself - Simple Minds  
5316  1997       Heart Cooks Brain - Modest Mouse  
356   2002                Rorol - Octopus Project

Simple popularity-based recommender class (Can be used as a black box)



In [ ]:

    
#Recommenders.popularity_recommender_py

Create an instance of popularity based recommender class



In [12]:

    
pm = Recommenders.popularity_recommender_py()
pm.create(train_data, 'user_id', 'song')

Use the popularity model to make some predictions



In [13]:

    
user_id = users[5]
pm.recommend(user_id)









    Out[13]:






  
    
      
      user_id
      song
      score
      Rank
    
  
  
    
      3194
      4bd88bfb25263a75bbdd467e74018f4ae570e5df
      Sehr kosmisch - Harmonia
      37
      1
    
    
      4083
      4bd88bfb25263a75bbdd467e74018f4ae570e5df
      Undo - Björk
      27
      2
    
    
      931
      4bd88bfb25263a75bbdd467e74018f4ae570e5df
      Dog Days Are Over (Radio Edit) - Florence + Th...
      24
      3
    
    
      4443
      4bd88bfb25263a75bbdd467e74018f4ae570e5df
      You're The One - Dwight Yoakam
      24
      4
    
    
      3034
      4bd88bfb25263a75bbdd467e74018f4ae570e5df
      Revelry - Kings Of Leon
      21
      5
    
    
      3189
      4bd88bfb25263a75bbdd467e74018f4ae570e5df
      Secrets - OneRepublic
      21
      6
    
    
      4112
      4bd88bfb25263a75bbdd467e74018f4ae570e5df
      Use Somebody - Kings Of Leon
      21
      7
    
    
      1207
      4bd88bfb25263a75bbdd467e74018f4ae570e5df
      Fireflies - Charttraxx Karaoke
      20
      8
    
    
      1577
      4bd88bfb25263a75bbdd467e74018f4ae570e5df
      Hey_ Soul Sister - Train
      19
      9
    
    
      1626
      4bd88bfb25263a75bbdd467e74018f4ae570e5df
      Horn Concerto No. 4 in E flat K495: II. Romanc...
      19
      10

Quiz 2: Use the popularity based model to make predictions for the following user id (Note the difference in recommendations from the first user id).



In [14]:

    
###Fill in the code here
user_id = users[8]
pm.recommend(user_id)









    Out[14]:






  
    
      
      user_id
      song
      score
      Rank
    
  
  
    
      3194
      9bb911319fbc04f01755814cb5edb21df3d1a336
      Sehr kosmisch - Harmonia
      37
      1
    
    
      4083
      9bb911319fbc04f01755814cb5edb21df3d1a336
      Undo - Björk
      27
      2
    
    
      931
      9bb911319fbc04f01755814cb5edb21df3d1a336
      Dog Days Are Over (Radio Edit) - Florence + Th...
      24
      3
    
    
      4443
      9bb911319fbc04f01755814cb5edb21df3d1a336
      You're The One - Dwight Yoakam
      24
      4
    
    
      3034
      9bb911319fbc04f01755814cb5edb21df3d1a336
      Revelry - Kings Of Leon
      21
      5
    
    
      3189
      9bb911319fbc04f01755814cb5edb21df3d1a336
      Secrets - OneRepublic
      21
      6
    
    
      4112
      9bb911319fbc04f01755814cb5edb21df3d1a336
      Use Somebody - Kings Of Leon
      21
      7
    
    
      1207
      9bb911319fbc04f01755814cb5edb21df3d1a336
      Fireflies - Charttraxx Karaoke
      20
      8
    
    
      1577
      9bb911319fbc04f01755814cb5edb21df3d1a336
      Hey_ Soul Sister - Train
      19
      9
    
    
      1626
      9bb911319fbc04f01755814cb5edb21df3d1a336
      Horn Concerto No. 4 in E flat K495: II. Romanc...
      19
      10

Build a song recommender with personalization

We now create an item similarity based collaborative filtering model that allows us to make personalized recommendations to each user.

Class for an item similarity based personalized recommender system (Can be used as a black box)



In [ ]:

    
#Recommenders.item_similarity_recommender_py

Create an instance of item similarity based recommender class



In [15]:

    
is_model = Recommenders.item_similarity_recommender_py()
is_model.create(train_data, 'user_id', 'song')

Use the personalized model to make some song recommendations



In [16]:

    
#Print the songs for the user in training data
user_id = users[5]
user_items = is_model.get_user_items(user_id)
#
print("------------------------------------------------------------------------------------")
print("Training data songs for the user userid: %s:" % user_id)
print("------------------------------------------------------------------------------------")

for user_item in user_items:
    print(user_item)

print("----------------------------------------------------------------------")
print("Recommendation process going on:")
print("----------------------------------------------------------------------")

#Recommend songs for the user using personalized model
is_model.recommend(user_id)









    



------------------------------------------------------------------------------------
Training data songs for the user userid: 4bd88bfb25263a75bbdd467e74018f4ae570e5df:
------------------------------------------------------------------------------------
Just Lose It - Eminem
Without Me - Eminem
16 Candles - The Crests
Speechless - Lady GaGa
Push It - Salt-N-Pepa
Ghosts 'n' Stuff (Original Instrumental Mix) - Deadmau5
Say My Name - Destiny's Child
My Dad's Gone Crazy - Eminem / Hailie Jade
The Real Slim Shady - Eminem
Somebody To Love - Justin Bieber
Forgive Me - Leona Lewis
Missing You - John Waite
Ya Nada Queda - Kudai
----------------------------------------------------------------------
Recommendation process going on:
----------------------------------------------------------------------
No. of unique songs for the user: 13
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :2097






    Out[16]:






  
    
      
      user_id
      song
      score
      rank
    
  
  
    
      0
      4bd88bfb25263a75bbdd467e74018f4ae570e5df
      Superman - Eminem / Dina Rae
      0.088692
      1
    
    
      1
      4bd88bfb25263a75bbdd467e74018f4ae570e5df
      Mockingbird - Eminem
      0.067663
      2
    
    
      2
      4bd88bfb25263a75bbdd467e74018f4ae570e5df
      I'm Back - Eminem
      0.065385
      3
    
    
      3
      4bd88bfb25263a75bbdd467e74018f4ae570e5df
      U Smile - Justin Bieber
      0.064525
      4
    
    
      4
      4bd88bfb25263a75bbdd467e74018f4ae570e5df
      Here Without You - 3 Doors Down
      0.062293
      5
    
    
      5
      4bd88bfb25263a75bbdd467e74018f4ae570e5df
      Hellbound - J-Black & Masta Ace
      0.055769
      6
    
    
      6
      4bd88bfb25263a75bbdd467e74018f4ae570e5df
      The Seed (2.0) - The Roots / Cody Chestnutt
      0.052564
      7
    
    
      7
      4bd88bfb25263a75bbdd467e74018f4ae570e5df
      I'm The One Who Understands (Edit Version) - War
      0.052564
      8
    
    
      8
      4bd88bfb25263a75bbdd467e74018f4ae570e5df
      Falling - Iration
      0.052564
      9
    
    
      9
      4bd88bfb25263a75bbdd467e74018f4ae570e5df
      Armed And Ready (2009 Digital Remaster) - The ...
      0.052564
      10

Quiz 3. Use the personalized model to make recommendations for the following user id. (Note the difference in recommendations from the first user id.)



In [17]:

    
user_id = users[7]
#Fill in the code here
user_items = is_model.get_user_items(user_id)
#
print("------------------------------------------------------------------------------------")
print("Training data songs for the user userid: %s:" % user_id)
print("------------------------------------------------------------------------------------")

for user_item in user_items:
    print(user_item)

print("----------------------------------------------------------------------")
print("Recommendation process going on:")
print("----------------------------------------------------------------------")

#Recommend songs for the user using personalized model
is_model.recommend(user_id)









    



------------------------------------------------------------------------------------
Training data songs for the user userid: 9d6f0ead607ac2a6c2460e4d14fb439a146b7dec:
------------------------------------------------------------------------------------
Swallowed In The Sea - Coldplay
Life In Technicolor ii - Coldplay
Life In Technicolor - Coldplay
The Scientist - Coldplay
Trouble - Coldplay
Strawberry Swing - Coldplay
Lost! - Coldplay
Clocks - Coldplay
----------------------------------------------------------------------
Recommendation process going on:
----------------------------------------------------------------------
No. of unique songs for the user: 8
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :3429






    Out[17]:






  
    
      
      user_id
      song
      score
      rank
    
  
  
    
      0
      9d6f0ead607ac2a6c2460e4d14fb439a146b7dec
      She Just Likes To Fight - Four Tet
      0.281579
      1
    
    
      1
      9d6f0ead607ac2a6c2460e4d14fb439a146b7dec
      Warning Sign - Coldplay
      0.281579
      2
    
    
      2
      9d6f0ead607ac2a6c2460e4d14fb439a146b7dec
      We Never Change - Coldplay
      0.281579
      3
    
    
      3
      9d6f0ead607ac2a6c2460e4d14fb439a146b7dec
      Puppetmad - Puppetmastaz
      0.281579
      4
    
    
      4
      9d6f0ead607ac2a6c2460e4d14fb439a146b7dec
      God Put A Smile Upon Your Face - Coldplay
      0.281579
      5
    
    
      5
      9d6f0ead607ac2a6c2460e4d14fb439a146b7dec
      Susie Q - Creedence Clearwater Revival
      0.281579
      6
    
    
      6
      9d6f0ead607ac2a6c2460e4d14fb439a146b7dec
      The Joker - Fatboy Slim
      0.281579
      7
    
    
      7
      9d6f0ead607ac2a6c2460e4d14fb439a146b7dec
      Korg Rhythm Afro - Holy Fuck
      0.281579
      8
    
    
      8
      9d6f0ead607ac2a6c2460e4d14fb439a146b7dec
      This Unfolds - Four Tet
      0.281579
      9
    
    
      9
      9d6f0ead607ac2a6c2460e4d14fb439a146b7dec
      high fives - Four Tet
      0.281579
      10

We can also apply the model to find similar songs to any song in the dataset



In [18]:

    
is_model.get_similar_items(['U Smile - Justin Bieber'])









    



no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :271






    Out[18]:






  
    
      
      user_id
      song
      score
      rank
    
  
  
    
      0
      
      Somebody To Love - Justin Bieber
      0.428571
      1
    
    
      1
      
      Bad Company - Five Finger Death Punch
      0.375000
      2
    
    
      2
      
      Love Me - Justin Bieber
      0.333333
      3
    
    
      3
      
      One Time - Justin Bieber
      0.333333
      4
    
    
      4
      
      Here Without You - 3 Doors Down
      0.333333
      5
    
    
      5
      
      Stuck In The Moment - Justin Bieber
      0.333333
      6
    
    
      6
      
      Teach Me How To Dougie - California Swag District
      0.333333
      7
    
    
      7
      
      Paper Planes - M.I.A.
      0.333333
      8
    
    
      8
      
      Already Gone - Kelly Clarkson
      0.333333
      9
    
    
      9
      
      The Funeral (Album Version) - Band Of Horses
      0.300000
      10

Quiz 4. Use the personalized recommender model to get similar songs for the following song.



In [19]:

    
song = 'Yellow - Coldplay'
###Fill in the code here
is_model.get_similar_items([song])









    



no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :969






    Out[19]:






  
    
      
      user_id
      song
      score
      rank
    
  
  
    
      0
      
      Fix You - Coldplay
      0.375000
      1
    
    
      1
      
      Creep (Explicit) - Radiohead
      0.291667
      2
    
    
      2
      
      Clocks - Coldplay
      0.280000
      3
    
    
      3
      
      Seven Nation Army - The White Stripes
      0.250000
      4
    
    
      4
      
      Paper Planes - M.I.A.
      0.208333
      5
    
    
      5
      
      Halo - Beyoncé
      0.200000
      6
    
    
      6
      
      The Funeral (Album Version) - Band Of Horses
      0.181818
      7
    
    
      7
      
      In My Place - Coldplay
      0.181818
      8
    
    
      8
      
      Kryptonite - 3 Doors Down
      0.166667
      9
    
    
      9
      
      When You Were Young - The Killers
      0.166667
      10

Quantitative comparison between the models

We now formally compare the popularity and the personalized models using precision-recall curves.

Class to calculate precision and recall (This can be used as a black box)



In [20]:

    
#Evaluation.precision_recall_calculator

Use the above precision recall calculator class to calculate the evaluation measures



In [20]:

    
start = time.time()

#Define what percentage of users to use for precision recall calculation
user_sample = 0.05

#Instantiate the precision_recall_calculator class
pr = Evaluation.precision_recall_calculator(test_data, train_data, pm, is_model)

#Call method to calculate precision and recall values
(pm_avg_precision_list, pm_avg_recall_list, ism_avg_precision_list, ism_avg_recall_list) = pr.calculate_measures(user_sample)

end = time.time()
print(end - start)









    



Length of user_test_and_training:319
Length of user sample:15
Getting recommendations for user:ea3b77e3f9b5688dc3998b2e706ea2c0ca48b8eb
No. of unique songs for the user: 15
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :1742
Getting recommendations for user:78e0065cacc15d6329be91b77045f12ab18cbea5
No. of unique songs for the user: 9
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :215
Getting recommendations for user:5ab56ead71b71022f7043fef70a178b7035629b6
No. of unique songs for the user: 6
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :1896
Getting recommendations for user:881f2e87fe2a45ae27d6e235c156c762ac3cb82a
No. of unique songs for the user: 6
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :1268
Getting recommendations for user:5d5e0142e54c3bb7b69f548c2ee55066c90700eb
No. of unique songs for the user: 31
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :3306
Getting recommendations for user:be0a4b64e9689c46e94b5a9a9c7910ee61aeb16f
No. of unique songs for the user: 76
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :12637
Getting recommendations for user:a8268c552c1122626ba8ab4d7cf2f799de7931b2
No. of unique songs for the user: 23
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :3834
Getting recommendations for user:53ba380d234fd6022818340983570354ee207f6b
No. of unique songs for the user: 10
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :348
Getting recommendations for user:1a849df9dabb15845eb932d46d81e2fd77176786
No. of unique songs for the user: 44
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :6357
Getting recommendations for user:8814f5d1f1d7177aa2efb6de6454504f3bb7b7bc
No. of unique songs for the user: 5
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :741
Getting recommendations for user:d9f2ea75b38f548535caee41d2c0b0e3f9859b1b
No. of unique songs for the user: 8
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :2187
Getting recommendations for user:f608c215606e6421a429ea28ad08243241d5347d
No. of unique songs for the user: 27
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :2286
Getting recommendations for user:a54543f7282b66b3c8423181bf2789e1c7eb2edc
No. of unique songs for the user: 10
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :1014
Getting recommendations for user:ea64e003562d2f0f39e5a7dd84af5b1969e0fea3
No. of unique songs for the user: 10
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :1151
Getting recommendations for user:95b2ebf54cd69d732fa433ee8994be5818793efb
No. of unique songs for the user: 12
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :646
61.7630341053009

Code to plot precision recall curve



In [21]:

    
import pylab as pl

#Method to generate precision and recall curve
def plot_precision_recall(m1_precision_list, m1_recall_list, m1_label, m2_precision_list, m2_recall_list, m2_label):
    pl.clf()    
    pl.plot(m1_recall_list, m1_precision_list, label=m1_label)
    pl.plot(m2_recall_list, m2_precision_list, label=m2_label)
    pl.xlabel('Recall')
    pl.ylabel('Precision')
    pl.ylim([0.0, 0.20])
    pl.xlim([0.0, 0.20])
    pl.title('Precision-Recall curve')
    #pl.legend(loc="upper right")
    pl.legend(loc=9, bbox_to_anchor=(0.5, -0.2))
    pl.show()



In [22]:

    
print("Plotting precision recall curves.")

plot_precision_recall(pm_avg_precision_list, pm_avg_recall_list, "popularity_model",
                      ism_avg_precision_list, ism_avg_recall_list, "item_similarity_model")









    



Plotting precision recall curves.

Generate Precision Recall curve using pickled results on a larger data subset(Python 3)



In [23]:

    
print("Plotting precision recall curves for a larger subset of data (100,000 rows) (user sample = 0.005).")

#Read the persisted files 
pm_avg_precision_list = joblib.load('pm_avg_precision_list_3.pkl')
pm_avg_recall_list = joblib.load('pm_avg_recall_list_3.pkl')
ism_avg_precision_list = joblib.load('ism_avg_precision_list_3.pkl')
ism_avg_recall_list = joblib.load('ism_avg_recall_list_3.pkl')

print("Plotting precision recall curves.")
plot_precision_recall(pm_avg_precision_list, pm_avg_recall_list, "popularity_model",
                      ism_avg_precision_list, ism_avg_recall_list, "item_similarity_model")









    



Plotting precision recall curves for a larger subset of data (100,000 rows) (user sample = 0.005).
Plotting precision recall curves.

Generate Precision Recall curve using pickled results on a larger data subset(Python 2.7)



In [24]:

    
print("Plotting precision recall curves for a larger subset of data (100,000 rows) (user sample = 0.005).")

pm_avg_precision_list = joblib.load('pm_avg_precision_list_2.pkl')
pm_avg_recall_list = joblib.load('pm_avg_recall_list_2.pkl')
ism_avg_precision_list = joblib.load('ism_avg_precision_list_2.pkl')
ism_avg_recall_list = joblib.load('ism_avg_recall_list_2.pkl')

print("Plotting precision recall curves.")
plot_precision_recall(pm_avg_precision_list, pm_avg_recall_list, "popularity_model",
                      ism_avg_precision_list, ism_avg_recall_list, "item_similarity_model")









    



Plotting precision recall curves for a larger subset of data (100,000 rows) (user sample = 0.005).
Plotting precision recall curves.

The curve shows that the personalized model provides much better performance over the popularity model.

Matrix Factorization based Recommender System

Using SVD matrix factorization based collaborative filtering recommender system -------------------------------------------------------------------------------- The following code implements a Singular Value Decomposition (SVD) based matrix factorization collaborative filtering recommender system. The user ratings matrix used is a small matrix as follows: Item0 Item1 Item2 Item3 User0 3 1 2 3 User1 4 3 4 3 User2 3 2 1 5 User3 1 6 5 2 User4 0 0 5 0 As we can see in the above matrix, all users except user 4 rate all items. The code calculates predicted recommendations for user 4.

Import the required libraries



In [25]:

    
#Code source written with help from: 
#http://antoinevastel.github.io/machine%20learning/python/2016/02/14/svd-recommender-system.html

import math as mt
import csv
from sparsesvd import sparsesvd #used for matrix factorization
import numpy as np
from scipy.sparse import csc_matrix #used for sparse matrix
from scipy.sparse.linalg import * #used for matrix multiplication

#Note: You may need to install the library sparsesvd. Documentation for 
#sparsesvd method can be found here:
#https://pypi.python.org/pypi/sparsesvd/

Methods to compute SVD and recommendations



In [26]:

    
#constants defining the dimensions of our User Rating Matrix (URM)
MAX_PID = 4
MAX_UID = 5

#Compute SVD of the user ratings matrix
def computeSVD(urm, K):
    U, s, Vt = sparsesvd(urm, K)

    dim = (len(s), len(s))
    S = np.zeros(dim, dtype=np.float32)
    for i in range(0, len(s)):
        S[i,i] = mt.sqrt(s[i])

    U = csc_matrix(np.transpose(U), dtype=np.float32)
    S = csc_matrix(S, dtype=np.float32)
    Vt = csc_matrix(Vt, dtype=np.float32)
    
    return U, S, Vt

#Compute estimated rating for the test user
def computeEstimatedRatings(urm, U, S, Vt, uTest, K, test):
    rightTerm = S*Vt 

    estimatedRatings = np.zeros(shape=(MAX_UID, MAX_PID), dtype=np.float16)
    for userTest in uTest:
        prod = U[userTest, :]*rightTerm
        #we convert the vector to dense format in order to get the indices 
        #of the movies with the best estimated ratings 
        estimatedRatings[userTest, :] = prod.todense()
        recom = (-estimatedRatings[userTest, :]).argsort()[:250]
    return recom

Use SVD to make predictions for a test user id, say 4



In [27]:

    
#Used in SVD calculation (number of latent factors)
K=2

#Initialize a sample user rating matrix
urm = np.array([[3, 1, 2, 3],[4, 3, 4, 3],[3, 2, 1, 5], [1, 6, 5, 2], [5, 0,0 , 0]])
urm = csc_matrix(urm, dtype=np.float32)

#Compute SVD of the input user ratings matrix
U, S, Vt = computeSVD(urm, K)

#Test user set as user_id 4 with ratings [0, 0, 5, 0]
uTest = [4]
print("User id for whom recommendations are needed: %d" % uTest[0])

#Get estimated rating for test user
print("Predictied ratings:")
uTest_recommended_items = computeEstimatedRatings(urm, U, S, Vt, uTest, K, True)
print(uTest_recommended_items)









    



User id for whom recommendations are needed: 4
Predictied ratings:
[0 3 2 1]

Quiz 4

a.) Change the input matrix row for test userid 4 in the user ratings matrix to the following value. Note the difference in predicted recommendations in this case. i.) [5 0 0 0] (Note*: The predicted ratings by the code include the items already rated by test user as well. This has been left purposefully like this for better understanding of SVD). SVD tutorial: http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm

Understanding Intuition behind SVD

SVD result gives three matrices as output: U, S and Vt (T in Vt means transpose). Matrix U represents user vectors and Matrix Vt represents item vectors. In simple terms, U represents users as 2 dimensional points in the latent vector space, and Vt represents items as 2 dimensional points in the same space.

Next, we print the matrices U, S and Vt and try to interpret them. Think how the points for users and items will look like in a 2 dimensional axis. For example, the following code plots all user vectors from the matrix U in the 2 dimensional space. Similarly, we plot all the item vectors in the same plot from the matrix Vt.



In [28]:

    
%matplotlib inline
from pylab import *

#Plot all the users
print("Matrix Dimensions for U")
print(U.shape)

for i in range(0, U.shape[0]):
    plot(U[i,0], U[i,1], marker = "*", label="user"+str(i))

for j in range(0, Vt.T.shape[0]):
    plot(Vt.T[j,0], Vt.T[j,1], marker = 'd', label="item"+str(j))    
    
legend(loc="upper right")
title('User vectors in the Latent semantic space')
ylim([-0.7, 0.7])
xlim([-0.7, 0])
show()









    



Matrix Dimensions for U
(5, 2)

	user_id	song_id	listen_count	title	release	artist_name	year
0	b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOAKIMP12A8C130995	1	The Cove	Thicker Than Water	Jack Johnson	0
1	b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOBBMDR12A8C13253B	2	Entre Dos Aguas	Flamenco Para Niños	Paco De Lucia	1976
2	b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOBXHDL12A81C204C0	1	Stronger	Graduation	Kanye West	2007
3	b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOBYHAJ12A6701BF1D	1	Constellations	In Between Dreams	Jack Johnson	2005
4	b80344d063b5ccb3212f76538f3d9e43d87dca9e	SODACBL12A8C13C273	1	Learn To Fly	There Is Nothing Left To Lose	Foo Fighters	1999

	song	listen_count	percentage
3660	Sehr kosmisch - Harmonia	45	0.45
4678	Undo - Björk	32	0.32
5105	You're The One - Dwight Yoakam	32	0.32
1071	Dog Days Are Over (Radio Edit) - Florence + Th...	28	0.28
3655	Secrets - OneRepublic	28	0.28
4378	The Scientist - Coldplay	27	0.27
4712	Use Somebody - Kings Of Leon	27	0.27
3476	Revelry - Kings Of Leon	26	0.26
1387	Fireflies - Charttraxx Karaoke	24	0.24
1862	Horn Concerto No. 4 in E flat K495: II. Romanc...	23	0.23
1805	Hey_ Soul Sister - Train	22	0.22
5032	Yellow - Coldplay	22	0.22
808	Clocks - Coldplay	21	0.21
2620	Lucky (Album Version) - Jason Mraz & Colbie Ca...	20	0.20
2299	Just Dance - Lady GaGa / Colby O'Donis	19	0.19
456	Billionaire [feat. Bruno Mars] (Explicit Albu...	18	0.18
2689	Marry Me - Train	18	0.18
3064	OMG - Usher featuring will.i.am	18	0.18
4543	Tive Sim - Cartola	18	0.18
142	Alejandro - Lady GaGa	17	0.17
726	Catch You Baby (Steve Pitron & Max Sanna Radio...	17	0.17
1410	Float On - Modest Mouse	17	0.17
3868	Somebody To Love - Justin Bieber	17	0.17
631	Bulletproof - La Roux	16	0.16
1143	Drop The World - Lil Wayne / Eminem	16	0.16
3038	Nothin' On You [feat. Bruno Mars] (Album Versi...	16	0.16
4465	They Might Follow You - Tiny Vipers	16	0.16
870	Cosmic Love - Florence + The Machine	15	0.15
899	Creep (Explicit) - Radiohead	15	0.15
1680	Halo - Beyoncé	15	0.15
...	...	...	...
5094	You Yourself are Too Serious - The Mercury Pro...	1	0.01
5098	You'll Never Know (My Love) (Bovellian 07 Mix)...	1	0.01
5100	You're A Wolf (Album) - Sea Wolf	1	0.01
5102	You're Gonna Miss Me When I'm Gone - Brooks & ...	1	0.01
5103	You're Not Alone - ATB	1	0.01
5104	You're Not Alone - Olive	1	0.01
5108	You've Passed - Neutral Milk Hotel	1	0.01
5109	Young - Hollywood Undead	1	0.01
5111	Younger Than Springtime - William Tabbert	1	0.01
5112	Your Arms Feel Like home - 3 Doors Down	1	0.01
5113	Your Every Idol - Telefon Tel Aviv	1	0.01
5114	Your Ex-Lover Is Dead (Album Version) - Stars	1	0.01
5115	Your Guardian Angel - The Red Jumpsuit Apparatus	1	0.01
5117	Your House - Jimmy Eat World	1	0.01
5118	Your Love - The Outfield	1	0.01
5121	Your Mouth - Telefon Tel Aviv	1	0.01
5123	Your Song (Alternate Take 10) - Cilla Black	1	0.01
5126	Your Visits Are Getting Shorter - Bloc Party	1	0.01
5127	Your Woman - White Town	1	0.01
5130	Ze Rook Naar Rozen - Rob De Nijs	1	0.01
5131	Zebra - Beach House	1	0.01
5132	Zebra - Man Man	1	0.01
5133	Zero - The Pain Machinery	1	0.01
5135	Zopf: Pigtail - Penguin Café Orchestra	1	0.01
5137	aNYway - Armand Van Helden & A-TRAK Present Du...	1	0.01
5139	high fives - Four Tet	1	0.01
5140	in white rooms - Booka Shade	1	0.01
5143	paranoid android - Christopher O'Riley	1	0.01
5149	¿Lo Ves? [Piano Y Voz] - Alejandro Sanz	1	0.01
5150	Época - Gotan Project	1	0.01

	user_id	song	score	Rank
3194	4bd88bfb25263a75bbdd467e74018f4ae570e5df	Sehr kosmisch - Harmonia	37	1
4083	4bd88bfb25263a75bbdd467e74018f4ae570e5df	Undo - Björk	27	2
931	4bd88bfb25263a75bbdd467e74018f4ae570e5df	Dog Days Are Over (Radio Edit) - Florence + Th...	24	3
4443	4bd88bfb25263a75bbdd467e74018f4ae570e5df	You're The One - Dwight Yoakam	24	4
3034	4bd88bfb25263a75bbdd467e74018f4ae570e5df	Revelry - Kings Of Leon	21	5
3189	4bd88bfb25263a75bbdd467e74018f4ae570e5df	Secrets - OneRepublic	21	6
4112	4bd88bfb25263a75bbdd467e74018f4ae570e5df	Use Somebody - Kings Of Leon	21	7
1207	4bd88bfb25263a75bbdd467e74018f4ae570e5df	Fireflies - Charttraxx Karaoke	20	8
1577	4bd88bfb25263a75bbdd467e74018f4ae570e5df	Hey_ Soul Sister - Train	19	9
1626	4bd88bfb25263a75bbdd467e74018f4ae570e5df	Horn Concerto No. 4 in E flat K495: II. Romanc...	19	10

	user_id	song	score	Rank
3194	9bb911319fbc04f01755814cb5edb21df3d1a336	Sehr kosmisch - Harmonia	37	1
4083	9bb911319fbc04f01755814cb5edb21df3d1a336	Undo - Björk	27	2
931	9bb911319fbc04f01755814cb5edb21df3d1a336	Dog Days Are Over (Radio Edit) - Florence + Th...	24	3
4443	9bb911319fbc04f01755814cb5edb21df3d1a336	You're The One - Dwight Yoakam	24	4
3034	9bb911319fbc04f01755814cb5edb21df3d1a336	Revelry - Kings Of Leon	21	5
3189	9bb911319fbc04f01755814cb5edb21df3d1a336	Secrets - OneRepublic	21	6
4112	9bb911319fbc04f01755814cb5edb21df3d1a336	Use Somebody - Kings Of Leon	21	7
1207	9bb911319fbc04f01755814cb5edb21df3d1a336	Fireflies - Charttraxx Karaoke	20	8
1577	9bb911319fbc04f01755814cb5edb21df3d1a336	Hey_ Soul Sister - Train	19	9
1626	9bb911319fbc04f01755814cb5edb21df3d1a336	Horn Concerto No. 4 in E flat K495: II. Romanc...	19	10

	user_id	song	score	rank
0	4bd88bfb25263a75bbdd467e74018f4ae570e5df	Superman - Eminem / Dina Rae	0.088692	1
1	4bd88bfb25263a75bbdd467e74018f4ae570e5df	Mockingbird - Eminem	0.067663	2
2	4bd88bfb25263a75bbdd467e74018f4ae570e5df	I'm Back - Eminem	0.065385	3
3	4bd88bfb25263a75bbdd467e74018f4ae570e5df	U Smile - Justin Bieber	0.064525	4
4	4bd88bfb25263a75bbdd467e74018f4ae570e5df	Here Without You - 3 Doors Down	0.062293	5
5	4bd88bfb25263a75bbdd467e74018f4ae570e5df	Hellbound - J-Black & Masta Ace	0.055769	6
6	4bd88bfb25263a75bbdd467e74018f4ae570e5df	The Seed (2.0) - The Roots / Cody Chestnutt	0.052564	7
7	4bd88bfb25263a75bbdd467e74018f4ae570e5df	I'm The One Who Understands (Edit Version) - War	0.052564	8
8	4bd88bfb25263a75bbdd467e74018f4ae570e5df	Falling - Iration	0.052564	9
9	4bd88bfb25263a75bbdd467e74018f4ae570e5df	Armed And Ready (2009 Digital Remaster) - The ...	0.052564	10

	user_id	song	score	rank
0	9d6f0ead607ac2a6c2460e4d14fb439a146b7dec	She Just Likes To Fight - Four Tet	0.281579	1
1	9d6f0ead607ac2a6c2460e4d14fb439a146b7dec	Warning Sign - Coldplay	0.281579	2
2	9d6f0ead607ac2a6c2460e4d14fb439a146b7dec	We Never Change - Coldplay	0.281579	3
3	9d6f0ead607ac2a6c2460e4d14fb439a146b7dec	Puppetmad - Puppetmastaz	0.281579	4
4	9d6f0ead607ac2a6c2460e4d14fb439a146b7dec	God Put A Smile Upon Your Face - Coldplay	0.281579	5
5	9d6f0ead607ac2a6c2460e4d14fb439a146b7dec	Susie Q - Creedence Clearwater Revival	0.281579	6
6	9d6f0ead607ac2a6c2460e4d14fb439a146b7dec	The Joker - Fatboy Slim	0.281579	7
7	9d6f0ead607ac2a6c2460e4d14fb439a146b7dec	Korg Rhythm Afro - Holy Fuck	0.281579	8
8	9d6f0ead607ac2a6c2460e4d14fb439a146b7dec	This Unfolds - Four Tet	0.281579	9
9	9d6f0ead607ac2a6c2460e4d14fb439a146b7dec	high fives - Four Tet	0.281579	10

	song	score	rank
0	Somebody To Love - Justin Bieber	0.428571	1
1	Bad Company - Five Finger Death Punch	0.375000	2
2	Love Me - Justin Bieber	0.333333	3
3	One Time - Justin Bieber	0.333333	4
4	Here Without You - 3 Doors Down	0.333333	5
5	Stuck In The Moment - Justin Bieber	0.333333	6
6	Teach Me How To Dougie - California Swag District	0.333333	7
7	Paper Planes - M.I.A.	0.333333	8
8	Already Gone - Kelly Clarkson	0.333333	9
9	The Funeral (Album Version) - Band Of Horses	0.300000	10

	song	score	rank
0	Fix You - Coldplay	0.375000	1
1	Creep (Explicit) - Radiohead	0.291667	2
2	Clocks - Coldplay	0.280000	3
3	Seven Nation Army - The White Stripes	0.250000	4
4	Paper Planes - M.I.A.	0.208333	5
5	Halo - Beyoncé	0.200000	6
6	The Funeral (Album Version) - Band Of Horses	0.181818	7
7	In My Place - Coldplay	0.181818	8
8	Kryptonite - 3 Doors Down	0.166667	9
9	When You Were Young - The Killers	0.166667	10